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1. Introduction 



CONTINUUM ARMED BANDIT PROBLEM OF FEW VARIABLES IN HIGH 

DIMENSIONS 

HEMANT TYAGI AND BERND GARTNER 

Abstract. We consider the stochastic and adversarial settings of continuum armed bandits where the arms 
are indexed by [0, l] d . The reward functions r : [0, l] d — > R are assumed to intrinsically depend on at most k 

coordinate variables implying r(xi, , £<j) = a { x i\> ■ ■■ > x ii.) f° r distinct and unknown ii, , i^ € {1, , d} 

and some locally Holder continuous g : [0, l] fc — > K with exponent a £ (0,1]. We consider the setting 
where (i\, . . . , ij.) is fixed across time. We propose a simple modification of the CAB1 algorithm where we 

o + fc a 

construct the discrete set of points to obtain a bound of 0{n 2 "+ k (logn) 2l >+' C(k, d)) on the regret, with 
C(k, d) depending at most polynomially in k and sub-logarithmically in d. The construction is based on 
creating partitions of {1, . . . , d} and is probabilistic, hence our result holds with high probability. 

1—5 

m 

\», J • In online decision making problems, a player is required to play a strategy, chosen from a given set of 

l__J \ strategies S, over a period of n trials or rounds. Each strategy has a reward associated with it specified by a 

C/} ■ reward function r : S —> R which typically changes across time in a manner intially unknown to the player. 

i Q | | The aim of the player is to choose the strategies in a manner so that the total expected reward of the chosen 

strategies is maximized. There are two main types of online decision making problems depending on the 

P\| | type of feedback the player receives at the end of each round. 

(1) Best expert problem. In this problem the entire reward function is revealed to the algorithm, as 

f*. . feedback at the end of each round. 

r — ' (2) Multi-armed bandit problem. In this problem the algorithm only receives the reward associated with 

l/~) . the strategy that was played in the round. 

■^- \ There are many applications of online decision making problems such as routing [TJ [2] , wireless networks 

[3], online auction mechanisms [UEj, statistics (sequential design of experiments [5]) and economics (pricing 
[7]) to name a few. The performance of algorithms in online decision problems is measured in terms of their 
regret. This is defined as the difference between the total expected reward of the best constant (i.e. not 
varying with time) strategy and the expected reward of the sequence of strategics played by the player. 

Multi-armed bandit problems have been studied extensively when the strategy set S is finite and optimal 
regret bounds are known within a constant factor JHHH1IH]- On the other hand, the setting in which S consists 
of infinitely many arms has been an area of recent attention both from the point of view of the interesting 
nature of the problem and also due to its practical significance. Such problems are referred to as continuum 
armed bandit problems and are the focus of this paper. Usually S is considered to be a compact subset of 
a metric space such as K d . Some applications of these problems arc in: (i) online auction mechanism design 
[4j [5] where the set of feasible prices is representablc as an interval and, (ii) online oblivious routing [2] where 
S is a flow polytope. 

For a d-dimensional strategy space, if the only assumption made on the reward functions is on their 
degree of smoothness then any multi-armed bandit algorithm will incur worst-case regret which depends 
exponentially on d. To see this, let S = [— 1, l] d and say the reward function does not vary with time and is 
zero in all but one orthant O of <S where it taks a positive value somewhere. Clearly any algorithm would 
need in expectation Vt(2 d ) trials before it can sample a point from O thus implying a bound of tt(2 d ) on regret 
incurred. To circumvent this curse of dimensionality, the reward functions are typically considered to be linear 
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2 HEMANT TYAGI AND BERND GARTNER 

(see for example PHIE]) or convex (see for example [T^JUS]) for which the regret is polynomial in d and sub- 
linear in n. We consider the setting where the reward function r : [0, l] d — > R depends on only an unknown 
subset of k out of the d coordinate variables implying r(xi, . . . , Xd) = g(xi 17 ■ ■ ■ ,%i k )- The environment is 
allowed to sample the underlying function g cither in an i.i.d manner from some fixed underlying distribution 
(stochastic) or arbitrarily (adversarial). To the best of our knowledge, such a structure for reward functions 
has not been considered in the bandit setting previously. On the other hand, there has been significant 
effort in other fields to develop tractable algorithms for approximating d variatc functions (with d large) 
from point queries by assuming the functions to intrinsically depend on only a few variables or parameters 
(see [II H3 [H El [TS] and references within). 

Related Work. The continuum armed bandit problem was first introduced in |19j for the case d = 1 where 
an algorithm achieving a regret bound of o(n^ 2a+1 ''^ Sc " +1 ' + ' 11 ) for any 77 > was proposed for local Holder 
continuous mean reward functions with exponent a € (0, 1]. In [5] a lower bound of VL(n 1 ' 2 ) was proven 
for this problem. This was then improved upon in |13j where the author derived upper and lower bounds 
of 0(n 2a + 1 (logn) 2 °+ 1 ) and fl(n 2a + 1 ) respectively. In [20] the author considered a class of mean reward 
functions defined over a compact convex subset of M. d which have (i) a unique maxima x*, (ii) arc three 
times continuously diffcrcntiable and (iii) whose gradients are well behaved near x*. It was shown that a 
modified version of the Kiefer-Wolfowitz algorithm achieves a regret bound of 0(n 1 ' 2 ) which is also optimal. 
In [21] the d = 1 case was treated, with the mean reward function assumed to only satisfy a local Holder 
condition around the maxima x* with exponent a £ (0, 00). Under these assumptions the authors considered 

a modification of Klcinbcrg's algorithm [13] and achieved a regret bound of 0(n 1 + 2 °- Q ' 3 (log n) 1 + 2a -°' a ) where 
/3 is such that the Lebesgue measure of e-optimal states is O(e^). 

In [22J [22] the authors studied a very general setting for the multi-armed bandit problem in which S forms 
a metric space, with the reward function assumed to satisfy a Lipschitz condition with respect to this metric. 
In particular it was shown in [23] that if S = [0, l] d and the mean reward function satisfies a local Holder 
condition with exponent a £ (0, 00) around any maxima x* (the number of maxima assumed to be finite) 
then their algorithm achieves rate of growth of regret which is 0(y/n) [j and hence independent of dimension 

dE 

Our Contribution. We prove that when the fc-tuplc (ii, . . . , if.) is kept fixed across time but is unknown 
to the player, then a simple modification of the CAB1 algorithm can be used to achieve a regret bound 
of 0(n 2 <*+ k (log n) 2 <*+ k C(k, d)) where a £ (0,1] denotes the exponent of Holder continuity of the reward 
functions. The additional factor C{k, d) which depends at most polynomially in k and sub-logarithmically 
in d captures the uncertainity of not knowing the k active coordinates. The modification is in the manner 
of discretization of [0, l] d for which we consider a probabilistic construction based on creating partitions of 
{1, . . . , d} into k disjoint subsets. Furthermore the above bound holds for both the stochastic (underlying g 
is sampled in an i.i.d manner) and the adversarial (underlying g chosen arbitrarily at each round) models. 

Organization of the paper. The rest of the paper is organized as follows. In Section [2] we define the 
problem statement formally and outline our main results. Section [3] contains the analysis for the case k = \ 
for both the Stochastic and Adversarial models. In Section [4] we present an analysis for the general setting 
where 1 < k < d, including the construction of the discrete strategy sets. Finally in Section [5] we summarize 
our results and provide directions for future work. 

2. Problem Setup and Notation 

We consider that a compact set of strategies S = [0, l] d C R d is available to the player. Time steps are 
denoted by {1, . . . , n}. At each time step t £ {1, . . . , n}, a reward function r t : S — > R is chosen by the 
environment. Upon playing a strategy x t £ [0, l] d , the player receives the reward r t (xt) at time step t. 

Assumption 1. For some k < d, we assume each r t to have the low dimensional structure 

r t (xi,...,x d ) =g t (x il ,...,x ih ) (2.1) 

where (i\, . . . , ik) is a k-tuple with distinct integers ij £ {1, . . . , d} and g t : [0, l] fc — > R. 



u = 0(v) means u = 0(v) upto a logarithmic factor. 

Note that there is still a factor that is exponential in d in the regret bound. 
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In other words the reward at each time t depends on an unknown subset of k out of the d coordinate 
variables. For simplicity of notation, we denote the set of fc-tuples of the set {1, . . . ,d} by Tjf- We assume 
that k is known to the player, however it suffices to know a bound for k as well. The second assumption 
that we make is on the smoothness property of the reward functions. 

Definition 1. A function f : [0, l] fe — > R is locally uniformly Holder continuous with constant < L < oo, 
exponent < a < 1, and restriction 5 > if we have for all u, u' £ [0, l] k with || u — u' ||2< 6 that 

|/(u)-/(u')|<L||u-u'||2 (2.2) 

We denote the class of such functions f as C(a, L,d,k). 

The function class defined in Definition Q] was also considered in [19l [13] . We now define the two models 
that we analyze in this paper. These models essentially describe how the reward functions g t arc generated 
at each time step t. 

• Stochastic model. In this model, the reward functions gt are considered to be i.i.d samples from some 
fixed probability distribution over functions g : [0, l] k — > R. We define the expectation of the reward 
function as <?(u) = E[g(u)] where u € [0, l] fe . We require g to belong to C(a, L, S, k) and note that 
the individual samples g t need not necessarily be Holder continuous. Lastly wc make the following 
assumption on the distribution from which the random samples g are generated. 

Assumption 2. We assume that there exist constants C, so > so that\2 

E [ e a(B(u)- g (u))] < e H 2 * 2 V S G [-So, So], U € [0, l] fc 

The above assumption was considered in |24| for the d = 1 situation and essentially weakens the 
requirement that the range of the functions be limited to [0, 1] by allowing the reward functions to 
take values in R. Note that g is still continuous and hence bounded as it is defined over a compact 
domain. The optimal strategy x* is then defined as follows. 

x* := argmax xG[(u] dE[r(x)] = a,rgma,x x&[Q1]d E[g(x n , . ..,x ik )]. (2.3) 

• Adversarial model. In this model we consider the reward functions gt ■ [0, l] fc — > [0, 1] to be chosen 
arbitrarily by an oblivious adversary i.e., an adversary oblivious to the actions of the player. Note 
that unlike the stochastic model each gt is now required to belong to C(a, L, i5, k) and also has range 
limited to [0, 1]. The optimal strategy x* is then defined as follows. 

n n 

x* :=argmax xe[cu](i ^V t (x) = argmax xG[0 1]d J^ g t {x n , . . . , x ik ) (2.4) 

4=1 t=l 

Given the above models we measure the performance of a player over T rounds in terms of the regret 
defined as 

T T 

i?(T):=^E[r t (x*)-r t (x f )]=^E[ fft (x* i ,...,x*J- 5t (x(; ) ,...,x|* ) )]. (2.5) 

t=i t=i 

In (|2.5[) the expectation is defined over the random choices of gt for the stochastic model and the random 
choice of the strategy x t for the adversarial model respectively at each time t. 

Lower bound for <i-variate reward functions. As discussed in the introduction, for a d dimensional 
strategy space if the only assumption made on the reward functions is of smoothness or boundedness then 
any multi armed bandit algorithm must suffer a worst-case regret of il(2 d ). More formally |25| showed a 
precise exponential lower bound for stochastic continuum armed bandits where the mean reward function 
f : [0, l] d —> [0, 1] is globally L Lipschitz continuous w.r.t l^ norm. 
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It is easily verifiable that this assumption on g in turn implies the existence of some positive constants f , so so that 



^2.2 



E[ e *(r(x)-r(x))] < eT C s Vs e [-so, s ] , x G [0, 1] d where f(x) = E[r(x)] . 
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Theorem 1 (Bubeck et al. [25] ). For all strategies of the player and for all 



, Td 0.15L 2 /( d + 2 ) 
n > max < L 



max{<i, 2} 

the worst-case regret over the set !Fl of all environments that are globally L Lipschitz continuous w.r.t. the 
loo norm satisfies 

sup R{n) > 0.15Z, rf/(d+2) n^§ 

In our setting the reward functions are assumed to intrisically depend on a subset of k coordinate variables. 
Therefore it is clear that the worst-case incurred by any algorithm would be £7(2 fc ) which is still mild if k <C d. 
In particular, from Theorem[T]we would have for any algorithm that after n = f2(2 fc ) plays the regret incurred 

in the worst case is il(n k + 2 ). Also note that one would still expect to incur an extra factor depending on 
the dimension d in the regret term. This factor would account for the uncertainity in not knowing which k 
coordinates are active from {1, . . . , d}. 

Main result. The main result of our work is as follows. We provide in the form of Theorem [2] a bound 
on the regret which is 0(n 2 °+ fc (log n) 2 °+ fc ) with some additional terms which are at most polynomial in k 
and sub-logarithmic in d. This bound holds for both the stochastic and adversarial models and is under 
the assumption that the fc-tuple (ii,. . ■ ,ik) € T d is chosen once at the beginning of play and kept fixed 
thereafter. The algorithm CAB(d,k) which achieves this bound is described in Section @]q 

Theorem 2. Given that the k-tuple (ii , . . . , ifc) G Tjf is kept fixed across time but unknown to the player, 
the algorithm CAB(d,k) incurs a regret 

O (n^+^ (log n) 2 ^ k 2 <- 2a + k ) e 2 ^ (log d) 2 ^ k ~ J 

after n rounds of play with high probability for both the stochastic and adversarial models. 

The above result is stated as Lemma |4] for the stochastic setting and as Lemma [5] for the adversarial 
setting in Sections 14. II and 14.21 respectively. The main idea here is to discretize [0, l] d by first constructing a 
family of partitions A of {1, . . . , d} with each partition consisting of k disjoint subsets. The construction is 
probabilistic and the resulting A satisfies an important property (with high probability) namely the Partition 
Assumption as described in Section |U In particular we have that \A\ is 0(ke k \ogd) resulting in a total of 
Af^l^l points for some integer M > 0. The discrete strategy set is then used with a finite armed bandit 
algorithm such as UCB-1 [9] for the stochastic setting and Exp3 [8] for the adversarial setting, to achieve 
the regret bound of Theorem 

3. Analysis for case k = 1 

We first consider the relatively simple case when k = 1 as the analysis provides intuition about the more 
general setting where k > 1. 

3.1. Stochastic setting for the case k = 1. The functions {r t }™ =1 are independent samples from a fixed 
unknown probability distribution on functions r : [0, l] d —> M. where r(x) = g(xi), x = [x\, . . . , Xd). Here the 
unknown coordinate i is also assumed to be drawn independently from a fixed probability distribution on 
{1, . . . , d}. Thus at each t, the environment first randomly samples gt from some fixed probability distribution 
on functions g : [0, 1] — $■ R and then independently samples the active coordinate it from some distribution 
on {1, . . . , d}. Note that the player has no knowledge of gt and it at any time t, only the value of r t can be 
queried at some x 4 £ [0, l] d for all t = 1, . . . , n. 

By plugging in k = 1 in (|2.3j) we obtain the following definition of the optimal constant strategy x* . 



x* := argmax xe[01]d E[r(x)] = argmax xe[01] dE g , i [g(a; l )] = argmax xe[0jl ]dE,-[fl , (Xj)]. (3.1) 

The player proceeds by discretization of [0, l] d . On account of the low dimensional structure of the reward 
functions r t , we consider the region 



The diiference between CAB(d.k) and CAB1 lies in the manner of discretization of the region [0, l] d . In our setting, we 
consider a probabilistic construction of the discrete set of points based on partitions of {1, . . . , d}, unlike the equi-spaced 
sampling done in CAB1. 
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V := {{x,...,x)£R d :x£ [0,1]} 

and focus on its discretization. Hence the player samples equi-spaced points from V and then runs a finite 
armed bandit algorithm on the sampled points. Given that the player sampled M points, we denote the set 
of sampled points by 

Vm := | Pj G [0, l] d : p, = ^(1, . . . , if j _ . (3.2) 

Note that r t (pj) — g t (j/M) for pj £ Vm- Over a play of T rounds, the regret R(T) is defined as 

T T T 

R(T) := E£>(x*)-r t (x t )]=E[£>(4)- g t (x t )} = E[£ g(x* u ) - g(xt)] (3.3) 

t=\ t=x t=i 

where x t = (a;*, . . . , Xt) £ Vm and the expectation in the rightmost term is over the random choice it at 
each t. We now state in Lemma [T] a bound on the regret R(T) incurred by a player with knowledge of the 
time horizon T and which leverages the UCB-1 algorithm [9] on the finite strategy set Vm- 

Lemma 1. Given the above setting and assuming that the UCB-1 algorithm is used along with the strategy 
set Vm, we have for M = [(^11,)^] that R(T) = 0{T^s l og rfe(T)). 

Proof. Firstly we note that for some x' = [x 1 ', . . . ,x') £ Vm, R(T) can be written as: 

R(T) = E[J2 9(x* t ) - 9(x')] + E[J2 9{x') - g(x t )} 
t=i t=i 

= R 1 (T) + R 2 (T) 

Now we claim that 3 x' £ {1/M, . . . , 1} s.t Ri(T) < T/M a . To see this observe that E it [g(x* t )} < ff(a;* mMi ) 
where 

Imax = argmax ie{l! ... td} g{x$). 

Since a; ; * max £ [0,1] hence the claim follows by choosing x' closest to x\ max . Therefore Ri(T) can be 
bounded as follows. 



T 



R l (T)<J2(9(xLJ-9(x'))<Y, 



t=i t=i 

It remains to bound R2(T). Note here that R 2 {T) is bounded by the actual regret that would be incurred 
on the strategy set {1/M, . . . , 1}. Let x* £ Vm be optimal so that 

x» = argmax xe p M ?"(x) = (x*,...,x*) 

where x» = argmax^gn ; M ....ug(aO- Hence R2(T) < E[^ t=1 g(x t ) — g(x t )]. On account of Assumption 
[21 it can be shown that R2{T) = 0(y/MTlogT) for UCB-1 algorithm for M-armed stochastic bandits (see 
Theorem 3.1, JT3J) . Combining the bounds for Ri{T) and R2(T) we then have that 

T 



R(T) = 0[y/MTlogT+ — ) (3.4) 



Finally by plugging M = \(-^p I ;)'^+ T ~\ in (J3U) we obtain R(T) = 0(T^+^ log T + 5 ^ T). D 

Then using strategy set Vm with M defined as above, we can employ Algorithm Q] with MAB sub-routine 
being the UCB-1 algorithm. 

The above algorithm basically employs the discrete set Vm with Algorithm CAB1, proposed in [13] , 
The main idea is to split the unknown time horizon into intervals of varying size with each interval being 
twice as large as the preceeding one (doubling trick). Since the regret over an interval of T time steps 
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Algorithm 1 Algorithm CABl(d, 1) 



T = 1 

while T<ndo 

l/(2a+l)' 



M = 



T 
logT 



Initialize MAB with Vm := {pj £ [0, l] d : p 3 ■ = ^ (1, . . . , l) T }^ =1 
for t = T, . . . , min(2T - 1, n) do 

• get x t from MAB 

• Play x t and get r t (x t ) 

• Feed r t (x t ) back to MAB 
end for 

T = 2T 
end while 



(with T now known to the player) is O^^ 2 " log 1 + 2a T), we have that the overall regret over n plays is 
0(n 1 + 2a log 1+2o! n). Note that the regret incurred is independent of the dimension d and is also optimal 
upto a log-factor since a fi(n 1 + 2 ° ) bound is known [TJ] for the <i = 1 scenario. 

3.2. Adversarial setting for the case k = 1. In this setting, the reward functions {rt}™ =1 are chosen 
beforehand by an oblivious adversary i.e., the adversary does not select r t based on the past sequence of 
action-response pairs: (xi, ri(xi),X2,r 2 (x 2 ), . . . ,x 4 _i, r t _i(x t _i)). The reward function r t : [0, l] d — > [0,1] 
has the low dimensional structure r t (x) = gt(xi), where i £ {1, . . . , d} is chosen once at the beginning of 
plays and kept fixed thereafter. Hence at each t the adversary chooses g t £ C(a,6,L,l) arbitrarily but 
oblivious to the plays of the algorithm. By plugging in k — 1 in (|2.4[) we obtain the following definition of 
the optimal constant strategy x*. 

n n 

x* := argmax xe[(U]d ^V t (x) = argmax xe[01]d ^g- t (^). 
t=i t=i 

The regret incurred by the algorithm over a play of T rounds is defined as 

T T 

R{T) := E[J2 r t (x*) - r t (x*)] = E[^ g t (x*) - g t { Xl )]. 
t=x t=\ 

Note that the expectation here is over the random choice of the algorithm at each t, the adversary is 
assumed to be deterministic without loss of generality. To start off, we first consider the scenario where the 
adversary chooses some active coordinate at the beginning of play and fixes it across time. The bound on 
regret R(T) is shown for a player who employs the Exp3 algorithm [8] on the discrete set Vm (defined in 
Section [3. ip for an appropriate value of sample size M. 

Lemma 2. Consider the scenario where the active coordinate is fixed across time, i.e. i\ = i<i = ■ ■ ■ = i 
and unknown to the player. Then by employing the ExpS algorithm along with the strategy set Vm for 
M = KisfjO 5 ^! we have that R(T) = 0{Trr^ log^fe(T)). 

Proof. For some x' £ Vm we can re- write R(T) as R(T) = R^T) + R 2 {T) where 

T T 

R 1 (T) = E[]T r t (x*) - r t (x')] = E[£ g t {x*) - g t (x% 
t=\ t=\ 

T T 

R 2 (T) = E[J2 r t (x') - r t (x t )] = E^Qtix') - g t (x t )}. 
t=i t=i 

Now as x* £ [0, 1], 3x' £ {1/M, . . . , 1} closest to x* so that \x* — x'\ < l/AI. Choosing this x' we obtain 

Ri( T ) < ELi \ x t - A a < TM~ a . Furthermore we can bound R 2 {T) as follows. 
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R 2 (T) = E[J29t(x') - gt(xt)} < E[J2 9t{x*) ~ 9t{x t )\ 

t=i t=i 

where x* := argmax^g r x i M ;1 i J2t=i 9t( x )- in other words a;* is "optimal" out of {1/M, . . . , 1}. Hence 
by using the Exp3 algorithm on V M we get: R 2 (T) = 0(^TM log M). Lastly for M = [(isfr)^! wc 
have that R(T) = 0(T^k logTW T). D 

Using strategy set Vm with M defined as above, we can employ Algorithm Q] with MAB sub-routine 
being the Exp3 algorithm. As the regret in the inner loop is OiT 1 ^ 1 ^ log!+ 2 ° T), we have as a consequence 

of the doubling trick that the overall regret over n plays is 0(n 1 + 2a log 1+2 ° n). Wc also see that the bound 
is independent of the dimension d. 

Remark 1. Another way to look at the above setting, where the active coordinate is constant over time 
is the following. Let x max S [0,1] be such that x max = argmax xe i n X^t=i 9t( x ) • Then provided that 
i\ = ■ ■ ■ = it = i, we have that x* = x max (l, . . . , 1) T belongs to the set of optimal solutions, i.e. points in 
[0, l] d which maximize ^2 t gt(xi). Therefore we can play strategies from the set Vm and employ Algorithm^ 
to achieve regret independent of d. 

We also briefly remark on the scenario where the reward function g t remains unchanged across time, i. e. 
gi = • • • = g t = g while the active coordinate it changes arbitrarily. Denote x max = argmax x£ , ug(x). Then 
clearly x* = x max (l, . . . , 1) T is one of the maximizers (amongst others in [0, l] d ) of^2 t g(xi t ). Hence in this 
case too, we can play strategies from the set Vm and employ Algorithm^ to achieve regret independent of d. 

Lower bound on regret for Adversarial setting, k = 1. In Lemma [2] we had considered the setting 
where the active coordinate remains fixed over time. We now show that in case the active coordinate is 
allowed to change arbitrarily, then the worst-case regret incurred by any algorithm playing strategies only 
from the region |(x, . . . ,x) £ M. d : x e [0, 1]} over T rounds will be f2(T). To this end we construct the 
adversary ADV as follows. Before the start of play, ADV chooses uniformly at random a 2-partition (A\, A 2 ) 
of {1, . . . , d} which is then kept fixed across time. Note that there are in total 2 d — 2 ordered 2-partitions of 
{1, . . . , d} where A\, A 2 ^ <j> so that A\ denotes the first subset and A 2 the second subset U- At each t, ADV 
then randomly selects w.p 1/2 either h\ or h 2 where 

h,(x)-l U{X) ; ^t ' 1 / 2 ! and ho(x)-i ° ; ^l ' 1 ^ 
ftl W-\ ; x€ [1/2,1]. 2{) ~ I v(x) ; x€ [1/2,1]. 

and sets it as gt- Here u,v are considered to be C°° functions taking values in [0,1] with supports 
[0, 1/2] and [1/2, 1] respectively. Constraining u(0) = u(l/2) = u(l/2) = v(l) = 0, we assume u and v 
to be uniquely maximized at a £ (0, 1/2) and b £ (1/2, 1) repsectively, so that u(a) — v(b) = 1. Finally, 
if hi was selected then ADV samples the active coordinate i t £ u .a.r A\ else it samples i t G u .a.r A 2 . We 
now prove the lower bound on the regret of all algorithms constrained to play strategies from the region 
{(#, . . . , x) € R d : x G [0, 1]} against ADV, formally in Lemma [3] 

Lemma 3. Consider the adversarial setting with k = 1 where, 

(1) the adversary is allowed to choose both g t G C(a,L,8,l) for some positive constants a,L,S and 
it G {1, . . . , d} arbitrarily at each t, and 

(2) the player is only allowed to play strategies from V := {(x, . . . , x) G M. d : x G [0, 1]} 

We then have that the worst-case regret R(T) of any algorithm over T rounds satisfies R(T) > T/2. 

Proof. We consider the adversary ADV whose construction was descibed previously. Any randomized playing 
strategy is equivalent to an apriori random choice from the set of all deterministic strategics. Since ADV is 
oblivious to the actions of the player, it suffices to consider an upper bound on expected gain holding true 
for all deterministic strategics. Furthermore we consider all expectations to be conditioned on the event that 
the 2 partition (Ai,A 2 ) was chosen at the beginning by ADV. 



5 For example, for d = 3 wc have {({1, 2} , {3}), ({3} , {1, 2}), ({1, 3} , {2}), ({2} , {1, 3}), ({2, 3} , {1}), ({1} , {2, 3})} as the set 
of possible 2-partitions with ordering and with Ai, A2 5^ <j>. 
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Consider any deterministic algorithm ALG. Here ALG : ((xi,ri(xi)), (x 2 , r 2 (x 2 )), . . . , (x t _i, r t _i(x t _i))) — ► 
Xt is a deterministic mapping that maps the past (t — 1) plays/responses to a fixed strategy x t at time t. 
Therefore we get 



%*(**)] = 5 



E ^m 



E r^y^(^) 

.1*6^2 ' ' 



Denoting a = argmax xg rQ 1 i/ii(x) and b = argmax^gjQ x -ih 2 (x) we clearly have that x* £ [0, l] d where 

a ; % € A-y 

b ; i£A 2 . 

uniquely maximizes E[g t (xi t )] for each t = 1, ...,T. In particular, 



E[9e«)] 



£ ra"'(°» 



.£ra^> 



,-[fti(a) + /i 2 (6)] = 1 



(3.5) 



Using (13.51) and denoting x t = (xt, ■ ■ ■ , xt) £V as the strategy played at time t, we have 



R((A U A 2 ) 7 T) = E[Ert(x*) - r t (x t )] = E[J](1 - &(*,))] 

where i?((Ai, A 2 ), T) denotes the regret incurred by ALG when (A\, A 2 ) was selected as the 2 partition at 
the beginnning by ADV. Finally, since 



%*(&*)] = 



i/ll(x t ) 

\h 2 {x t ) 



x t G [0, 1/2] 
art e [1/2, 1] 



we have that E[g t (xt)} < 1/2 for any deterministic algorithm at time t. Hence R((Ai, A 2 ),T) > T/2. D 

Let us examine the result of Lemma [3] in detail. We first notice that for the adversary ADV, the optimal 
strategy x* £ [0, l] d will be such that the indices lying in A\ have the value a while the remanining indices 
have the value b. Also note that here a £ (0,1/2) and b £ (1/2,1). Furthermore as (Ai,A 2 ) constitutes 
a 2-partition of {1, . . . ,d} with Ai,A 2 ^ </>, the optimal solution x* will always have two indices such that 
one has the value a and the other has b. On the other hand, the region V := {(x, . . . , x) £ M. d : x £ [0, 1]} 
from which the player is confined to playing strategies from has the same value in all indices and so x* will 
never belong to V . Hence it is not surprising that the regret incurred by an deterministic algorithm playing 
strategies only from V against ADV is Q(T). 



4. Analysis for general case, with k active coordinates 

We now consider the general situation where the reward function r t consists of k active coordinates with 
1 < k < d. Note that for our analysis, we assume that k is known to the player. However knowing a bound 
for k also suffices. The discretized strategy set that we employ for our purposes here has a more involved 
structure since k can now be greater than one. 

To start off, let A denote a collection of partitions of {1, . . . , d} where each A £ A consists of k disjoint 
subsets (Ai, . , . ,Ak)- 

Partition Assumption. We require the family A to be rich enough so that given any k distinct integers 
i\, i 2l . . . , ik £ {1, ...,d}, there is a partition A in A such that each set in A contains exactly one of 
i\,%2, ■ ■ ■ ,ik- This condition is known as perfect hashing in theoretical computer science and is widely used 
such as in finding juntas. Two natural questions that now arise are: 

• How large should A be to guarantee the Partition Assumption? 

• How can one construct such a family of paritions A? 

For now we assume that we have such a family A in hand. Wc will later describe a probabilistic method as 
shown in [14] (Section 5), where we will construct A consisting of 0{ke k logcf) partitions which satisfies the 
partition assumption property with high probability. 
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Constructing strategy set Vm using A. Say we are given a family of partitions A satisfying the partition 
assumption. Then using A we construct the discrete set of strategies Vm & [0, l] d for some fixed integer 
M > as follows. 



V M ■= I P = ^7 J2 otiXA, ; «» £ {1, . . . , M} , (Ai, 



,A k )£A c[0,i] d (4.1) 



Note that a strategy P = jj Y)j—i cv-iXAi nas coordinate value -rja, at each of the coordinate indices in Aj. 
Therefore we see that for each partition A £ A we have M k strategies implying a total of M fe |.A| strategies 
in V M - 

Remark 2. Note that due to the above construction, we would have multiple entries o/Vm referring to the 
same x <E [0, l] d . However this does not affect us since we will be simply treating each entry of Vm as a 
separate strategy. 

Projection property. An important property of the strategy set Vm is the following. Given any (ji, . . . , jk) 
with distinct j q £ {1, . . . , d} and any integers 1 < ii,...,ik < M, there is a strategy P £ Vm such that 
coordinate j v of P is i v /M with v = 1, . . . , k. To sec this, one can simply take a partition A from A such 
that each j v is in a different set Ai of A. Then setting appropriate a, = i v when j v £ Ai we get that 
coordinate j v of P has the value i v /M . 

Constructing family of partitions A. We now present a probabilistic construction for the family of 
partitions A satisfying the partition assumption discussed previously. This fairly standard method can be 
found for example in |14j (Section 5) where the authors considered the problem of approximating Lipschitz 
functions intrinsically depending on a subset of the coordinate variables. Bounds for \A\ can be found in 
[261 EI]. 

We are given d and k where 1 < k < d. We are interested in constructing partitions A = (A\, . . . , Ak) 
of {1, ... , d} consisting of k disjoint subsets of {1, ... , d}. Now we draw balls labeled 1, . . . , k uniformly at 
random, with replacement d times. If a ball with label i is drawn at the r time, then we store the integer r 
in Ai. Hence after d rounds we would have put each integer from {1, . . . , d} into one of the sets A\,.,.,Ak. 
Note that some sets might be empty. On performing m independent trials of this experiment, we obtain a 
family A of m partitions generated independently at random. It remains to be shown how big m should be 
such that A satisfies the partition assumption. 

To see this, consider a fixed k tuple a = (oi, . . . , ak) £ (f )• The probability that a random partition A 
separates the components of a into distinct sets A\, . . . , Ak is p = pj-. Hence the probability that none of 
the m partitions separate a is (1 — p) m . Since we have (, ) tuples we thus have from the union bound that 
each k tuple a G ('J) is separated by a partition with probability at least 1 — ( fc )(l — p) m - Therefore to have 
a non zero probability of success, m should be large enough to ensure that (f)(l — p) m < 1- This is derived 
by making use of the following series of inequalities: 

where we use Stirling's approximation in the first inequality and used the fact (1 — x~ l ) x < e _1 when 
x > 1 in the third inequality. It follows that if m > 2ke k log d we then have that A meets the partition 
assumption with probability at least 1 — d~ k . 

4.1. Stochastic setting for the general case, k > 1. In this setting, the reward functions {r{}™ =1 
are independent samples from some fixed probability distribution on functions r : [0, l] d — > M. where 
r(xi, . . . ,Xd) = g{xin ■ ■ ■ , Xi k ) and ij £ T k d . At each time t, the environment randomly samples gt from 
some probability distribution on functions g : [0, l] k — > M. The set of k active coordinates are chosen from 
T d once at the beginning of plays and kept fixed thereafter. The optimal constant strategy x* £ [0, l] d is 
defined as 

x* = argmax xe[01]d E[r(x)] = argmax xG[0]1] dE[g(x il , . . . ,x ik )]. 
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The player discretizes the region [0, l] d by first constructing a family of partitions A using the probabilistic 
method described earlier. Therefore we have for \A\ = Cke k \ogd where C > 2 is a constant that A satisfies 
partition assumption with probability at least (1 — d~ fc ( c '~ 1 )). After constructing A, the player constructs 
the set Vm as defined in (|4.1[) so that \Vm\ = CM k ke k log d. Over a play of T rounds, the regret R(T) is 
defined as follows. 



R{T)^E^2g t ( X * i ,..., X * k )-g t (x^,..., X ^)}=E^2g(x* i 



,(') 



(On 



)-3«V--,< ; )] 



..(') 



t=i 
(*) 



(=i 



where (x\ , . . . , x d ) G Pm is the strategy played at time t and the expectation in the rightmost term is 
over the random choice of (i\, . . . , ik)- 

Lemma 4. Given the above setting and assuming that: 

(1) the k-tuple (i\, . . . , ik) G 7^ is fixed but unknown to the player across time and, 

(2) the UCB-1 algorithm is used along with the strategy set Vm, 
we have for 



M 



a — 3 k 

fc~2-e~2 (log cfT 



T 



logT 



that 



R(T) = O (T^+^(logT)5s+fcfc 2 ( 2 °+ fc )e 5S +^(logd)^+^ 



Proof. For some x' G V M wc can split R(T) into i?i(T) + i? 2 (T) where: 

T 

i? 1 (T)=E[^.g«,...,0-g«,...,<)], 



T 



i? 2 (T) = E{J2 fl« ,...,<)- ff(< , ■ • ■ , 2 

t=i 

Let us consider R\(T) first. For any (ii, . . . ,ifc) G Tjjf, 3x' G Pm with 



&■ 



Xn ~ A'/ ' " ' ' ^ ~ M 
where ai, . . . , ak are such that \otj/M — x*.\ < (1/M). This follows from the projection property of A. 
Therefore we can bound Ri(T) as follows. 



In other words, R X (T) = 0{Tk a / 2 M- a ). Wc now proceed to bound R 2 {T). We first define 

vW •= 



a/2 



argmax xe p M 5(xi 1 , . . . ,x ifc ). 



„(*) 



„(On 



This gives us R 2 {T) < J2t=i ^[s( x ii > • ■ • ! X L ) — 9( x \i > • ■ • i x i )]■ So, the problem has now reduced to a 
|"Pm| armed stochastic bandit one, for which x^*' G Vm is the optimal strategy in hindsight amongst those in 
Vm- We thus employ the UCB-1 algorithm and play at each t a strategy x* G Vm- In particular, on account 
of Assumption [2] it can be shown that R%(T) — 0{\J \V m\T \ogT) (see Theorem 3.1, [13]). Combining the 
bounds for R\{T) and R 2 {T) and recalling that \Vm\ = 0(M k ke k logd) we have that: 



Plugging M 



R(T) = 0(TAr a k a/2 + Vl'PjulTlogT) = 0(TAr a k a/2 + y / M k ke k \ogd TlogT). (4.2) 

in (|4.2[) we obtain the stated bound on R(T). 



(fcVe 5(bgd) 2 -\/5) 



D 



CONTINUUM ARMED BANDIT PROBLEM OF FEW VARIABLES IN HIGH DIMENSIONS 11 

Then using strategy set Vm with M defined as above, we can employ Algorithm [2] with MAB sub-routine 

/ a + k q a(k + 6) _ka_ g \ 

being the UCB-1 algorithm. Since regret in the inner loop is O ( T 2a + k (log T) 2 "+ k k 2 < 2 "+'i e 2a + k (log d) 2a + k 1, 

/ a + k a a(k + 6) _ka_ a \ 

we have that the overall regret is O I n 2a + k (logn) 2 °+ fc k 2(2a+k '> e 2a + k (logd) 2a + k J over n plays with high 
probability, by using the doubling trick. 

Algorithm 2 Algorithm CAB(d, k) 

T = 1 

Construct family of partitions A 

while T < n do 

.-» k .. .. l r^\ 2/(2a+fc)- 



M 



k 2 e 



•oogdH^/sy 



• Create T^aj using ^4 

• Initialize MAB with V M 

for t = T, . . , , min(2T - 1, n) do 

• get x t from MAB 

• Play x t and get r*(x t ) 

• Feed r t (x t ) back to MAB 
end for 

T = 2T 
end while 



4.2. Adversarial setting for the general case, k > 1. In this setting, the reward functions {r{}" =1 are 
chosen beforehand by an oblivious adversary i.e., the adversary does not select rt based on the past sequence of 
plays: (xi,r - i(xi), . . . , Xt_i, rt-i(xt-i)). Hence for each t, the adversary chooses gt G C(a,<5, L, k) arbitrarily 
but oblivious to the plays of the algorithm. The k active coordinates (ii,...,ik) S T£ are chosen once at 
the beginning of plays and kept fixed thereafter. The optimal constant strategy x* € [0, l] d is defined as 

n n 

x* :=argmax xe[01]d ^r t (x) = arginax xe[01]c! ^T g t {x iy , . . . , x lk ) 
t=i t=i 

The player discretizes [0, l] rf by constructing the set Vm using the probabilistic method described earlier 
for some integer M > 0. The regret incurred by the the player over T rounds is defined as 

T T 

R{T) = £E[r t (x*) - r t (x t )] = £%*«' . . .,<) - 5t (^, . . . ,x$)] 
t=i t=i 

where (x{ , . . . , x^ ) G Pa/ is the strategy played at time t. Note that the expectation here is over the 
random choice of the player at each t, the adversary is assumed to be deterministic without loss of generality. 
We now prove in the following lemma, an upper bound on regret for this setting over a play of T rounds. 

Lemma 5. Given the above setting and assuming that: 

(1) the k-tuple (ii, . . . , ij.) G 71 is fixed but unknown to the player across over time and, 

(2) the ExpS algorithm is used along with the strategy set Vm> 

we have for 



M = 
that 



k" 2 e 2 (logd) 



logT 



/ a + k a a(fc + 6) ka a 

R(T) ^0(T^+^ (lOg T)5^+E k 2(2 Q + fc) e 2S+fc (fog d)2S+7f 
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Proof. For some x' € V M wc can split R(T) into i?i(T) + R 2 (T) where 



i?i(T) = E[^2 5t« ,..-,Xi k )- g t (x' n ,...,x' ik )], 



t=i 

T 



R 2 (T)=E[J2gM 1 ,...,x' ik )-g t (x^,...,x\ t k ) )}. 
t=i 

Let us consider R\{T) first. For any (ii, . . . ,ifc) € 7^ d , 3x' € Pm with 



/ _ ax i _ «fc 

^ ~ A f--'^ - M 
where a±, . . . , ctk are such that \o>j/M — x*.\ < (1/M). This follows from the projection property of A. 
Therefore by observing that 



g t (x* il ,...,x* ik )-g t (x' il ,...,x' ik ) <L || «,...,<,)- «,...,<) H2 < ^ (( ^) & 
wc have that, i?i(T) = 0(Tk a / 2 M- a ). Wc now proceed to bound R 2 (T). Wc first define 



»/2 



r 



x W := argmax xe7 , M ^ 5t (x ll , . 
t=i 

This gives us R 2 (T) < Y^t=i^l9t( x ii ' ■ ■ • ' ^i ) " ff*( x i '••• > x i )]• ^o, the problem has now reduced to 
a IPmI armed adversarial bandit one, for which x'*' e T 3 ^/ is the optimal strategy in hindsight, amongst 
those in Vm ■ By employing the Exp3 algorithm and play at each t a strategy x t <G Vu-, we thus obtain 
(see 0) R 2 (T) = 0(^\V M \T log(\V M \)). Combining the bounds for R X (T) and R 2 (T) and recalling that 
\V M \ = 0(M k ke k \ogd) we have that: 



R(T) = 0(TAr a k a ' 2 + y/\V M \Tlog\V M \) = 0{TAr a k a ' 2 + J M k ke k log d T\og(M k ke k log d)). (4.3) 



Finally, it is verifiable that plugging the choice M 
the stated bound on R(T). 



(^e-*(logd)-i^5) 



2a + k 



in f|4.3[) we obtain 



□ 



Using strategy set Vm with M defined as above, we can employ Algorithm [2] with MAB sub-routine 
now being the Exp3 algorithm. As for the stochastic setting, we have that the overall regret over n plays is 
O ( n 2 °+ fc (log n) 2 °+ fc fc 2 ( 2 ^+ fe ) e 2 °+ fc (logd) 2 °+ fc 1 with high probability. 

5. CONCLUDING REMARKS 

In this work wc considered continuum armed bandit problems for the stochastic and adversarial settings 
where the reward function r : [0, l] d — > M. is assumed to depend on k out of the d coordinate variables implying 
r(x\ , . . . , Xd) = g{xi x , . . . , Xi k ) at each round. Assuming g to satisfy a local Holder continuity condition and 
the k tuple (ii, . . . , ik) to be fixed but unknown across time, we proposed a simple modification of the CAB1 
algorithm based on a probabilistic construction of the discrete set of points. We prove that it achieves a 
regret that scales sub-logarithmically with dimension d, with the rate of growth of regret depending only on 
k and the exponent a G (0, 1]. 

We note that the upper bound on regret proven in our paper is 0(n 2a + k (logn) 2 °+ fe C(k, d)) when (i\, . . . , ij.) 
is fixed across time, where C(fc, d) depends polynomially on k and sub-logarithmically in d. It would be in- 
teresting to derive lower bounds in this setting to find the optimal scaling of regret with dimension d. We 
also employ a probabilistic construction of the discrete set of points lying in [0, l] d for our purposes resulting 
in a high probability regret bound. It would be interesting to look into deterministic constructions of such 
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point sets. Lastly although we considered the setting where the k active coordinates (i\, . . . , ik) remain fixed, 
it is an interesting question to consider the setting where {i\, . . . ,ik) are also allowed to change over time. 
We leave this for future work. 
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